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SUMMARY 

Algorithms  which  take  advantage  of  massively  parallel  and  vector  architectures  are  formulated 
for  the  calculation  of  the  differences  of  times-of-arrival  derived  from  a  sequence  of  time  samples. 
The  application  which  generated  the  work  is  the  classification  of  pulse  trains  which  are  detected 
by  an  electronic  support  measures  receiver. 

Four  algorithms  are  proposed,  of  which  three  have  been  implemented  and  tested  on  several 
machines.  The  results  obtained  through  testing  are  compared  with  each  other  and  with  results  for 
more  conventional  architectures.  It  is  founa  that  one  of  the  machines,  the  massively  parallel 
AlASPAR  MP-1  computer,  is  able  to  perform  histogramming  fastest  through  exploitation  of  its 
data-parallel  architecture.  Several  suggestions  are  made  which  could  significantly  improve 
performance  on  ail  architectures. 
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1  INTRODUCTION 

Passive  detection  and  identification  of  radars  in  real  time  is  a  critical  operational  requirement 
for  Electronics  Support  Measures  (ESM)  equipment  such  as  radar  warning  receivers  (RWR’s). 
TypicaJly,  a  receiver  used  in  ESM  intercepts  signaJs  from  a  number  of  radars  and  other  emitters 
within  its  frequency  band  contemporaneously.  In  order  to  identify  individual  emitters  from  this 
composite  signal,  one  radar  parameter  which  is  often  derived  is  the  pulse  repetition  interval  (PRI) 
for  each  emitter,  i.e.  the  interval  between  the  times-of-arrival  (TOA’s)  of  consecutive  pulses  which 
it  emits. 

Other  parameters  which  may  be  measured  are  the  angle  of  arrival  (AOA),  frequency  and  amplitude 
to  form  a  set  of  pulse  descriptor  words  (PDW).  Utilising  the  AOA  data,  the  pulses  could  be  presorted 
into  clusters  of  adjacent  angular  sectors,  each  of  which  could  be  processed  independently.  However, 
the  situation  can  arise  in  which  it  is  impossible  to  distinguish  emitters  on  these  parameters  alone 
because  of  inadequate  resolution.  It  is  then  required  to  extract  as  much  information  as  possible 
from  the  TOA  data. 

In  a  dense  environment  comprising  many  types  of  emitters,  the  processing  of  this  information  is 
critically  time  intensive,  e.g.  for  real  scenarios  one  could  envisage  a  pulse  data  rate  of  10  million 
pulses  per  second.  Thus  for  real  time  operation,  any  improvement  in  the  speed  of  processing  is 
highly  desirable. 

Following  this  concept,  the  Information  and  Signal  Processing  Group  of  the  Electronic  Warfare 
Division  has  proposed  the  application  of  parallel  processor  technology  for  the  study  of  deinterleaving 
of  the  pulse  trains.  The  object  of  this  paper  is  to  outline  the  formulation  of  PRI  processing,  and  in 
particular  time-differencing  and  histogramming,  in  this  context.  The  paper  assumes  that  a  buffer 
of  TOA’s  is  continuously  filling  with  recorded  pulses,  and  the  buffer  data  is  periodically  processed 
(and  the  buffer  emptied)  by  the  processor(s). 

In  Section  2,  time-difference-of-arrival  (TDOA)  histogramming  is  introduced.  This  is  a  common 
processing  step  in  ESM  equipment  and  it  is  useful  because  it  highlights  periodicities  in  the  TOA 
data.  Some  of  the  properties  of  the  TDOA  histogram  are  discussed. 

Several  methods  of  performing  differencing  and  histogramming  on  massively  parallel  architectures 
are  proposed  in  Section  3.  The  method  of  implementation  on  a  two-dimensional  grid  of  processors  is 
discussed,  with  particular  reference  to  the  MASPAR  meissively  parallel  computer.  The  requirements 
in  time,  memory  and  number  of  processors  for  each  method  is  examined  as  a  function  of  the  number 
of  TOA’s  in  the  buffer. 

A  method  for  calculating  the  TDOA  histogram  on  a  vector  processor  architecture  is  discussed  in 
Section  4.  Again,  the  requirements  in  time  and  memory  are  examined  as  a  function  of  the  number 
of  TOA’s. 

In  Section  5,  the  histogramming  methods  for  massively  parallel,  vector  and  traditional  processors 
are  compared  and  timed.  A  MASPAR  MP-1  is  used  to  implement  the  massively  paradlel  methods, 
whereas  a  Fujitsu  VP-2200  is  used  to  implement  the  vector  processor  methods.  Sun  and  Hewlett- 
Packard  workstations  were  used  to  implement  histogramming  on  traditional  processors.  Timing 
comparisons  are  presented  for  various  histogram  bin  widths  and  pulse  densities  and  the  results  are 
reconciled  with  the  theory. 

Suggested  improvements  and  future  directions  are  outlined  in  Sections  6  and  7.  In  these  sections, 
several  suggestions  are  made  that  might  improve  the  performance  of  the  algorithms  with  respect 
to  their  speed  of  computation  and  areas  of  ongoing  research  are  highlighted. 
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Finally,  conclusions  are  drawn  in  Section  8  where  it  is  found  that  the  MASPAR  MP-1  performs 
the  fastest  histogram  of  the  architectures  examined.  However,  no  account  has  been  made  for  the 
input/output  time  requirements  of  the  machines  examined,  which  in  practice  could  significantly 
alter  their  performance. 


2  TDOA  HISTOGRAMMING 

Histogramming  is  one  of  the  most  widely  used  processing  operations  for  the  detection  of  pulsed 
emitters  [2,ft-9].  Let  the  TOA  data  for  ;V  pulses  from  the  PDW’s  be  arranged  in  order  of  arrival 
as  a  vector  t  where  ^ 

t  =  [  to  fi  ]  (1) 

where  Np  is  the  number  of  PDW’s  in  the  sample  and  U  <  t,  for  i  <  j. 

The  matrix  of  the  differences  in  times  of  arrivaJ,  A,  is  then  defined  by  its  element  A,j  where 

Aij  =  tj  —  ti 

where  0  <  i,j  <  ;Vp.  Note  that  A  is  antisynunetric. 

The  TDOA  histogram  is  the  vector  H  of  A*  (where  the  individual  h*  are  known  as  the  "histogram 
bins”)  of  dimension  Nh  such  that 

A*  =  #{Af»  <  <(k  +  1)1,,  0  $  ij  <  Np}, 

t,  is  the  histogram  bin  width,  and  fe  =  0, 1, .  ..,Nn  -  1  and  i^{'}  represents  the  cardinality  of  the 
set. 

The  attractive  property  of  the  TDOA  histogram  is  that  for  a  pulse  train  of  periodicity  T,  peaks  will 
appear  in  histogram  bins  A»  where  fe  =  [r»r/t,J  for  all  integers  n  >  0  where  [-J  returns  the  largest 
integer  smaller  thaui  its  operand.  Further,  interference  between  pulse  trains  tends  to  produce  a 
"flat”  noise  background.  That  is,  bins  which  do  not  contain  peaks  caused  by  pulse  trains  or  their 
harmonics  have  a  relatively  conslaat  j.iu,.Iitude.  It  is  the:  Tore  relatively  easy  to  visually  detect 
peaks  in  the  histogram  from  the  interference  noise  under  most  circumstances. 

Automatic  detection  and  estimation  methods  are  currently  topics  of  ongoing  research,  and  fall 
outside  the  scope  of  this  report  [l].  This  report  concerns  itself  only  with  the  calculation  of  the 
histogram. 

Further,  it  is  a  non-trivial  system  design  problem  to  decide  on  the  appropriate  buffer  length  for 
incoming  pulses  to  histogram,  the  appropriate  bin  width  for  the  histogram  and  how  to  then  combine 
or  interpret  the  results  of  consecutive  histograms  from  the  buffers.  Again,  this  is  considered  to  be 
outside  the  scope  of  this  report. 


3  DIFFERENCING  AND  HISTOGRAMMING  USING  THE  SIMD 

ARCHITECTURE 

The  model  considered  in  this  section  is  the  Single  Instruction  Multiple  Data  architecture  (SIMD). 
In  this  computational  model,  a  single  instruction  is  issued,  one  at  a  time,  to  the  complete  set 
of  processors.  Each  active  processor  performs  this  instruction  on  its  own  local  data.  This  is  in 
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contrast  to  the  Sintsle  Instruction  Single  Data  (SISD)  model  or  von  Neumann  model  (used  in  tradi¬ 
tional  computers  such  as  personal  computers  and  most  workstations)  and  the  Multiple  Instruction 
Multiple  Data  (MIMD)  model  in  which  each  processor  can  be  executing  separate  instructions  on 
separate  data. 

In  particular,  a  SIMD  model  is  examined  in  which  the  processors  we  arranged  in  a  two-dimensional 
matrix,  with  each  processor  element  able  to  communicate  data  with  any  other  processor  element, 
and  each  having  both  local  (private)  and  globally  shared  memory.  This  is  the  architecture  of  the 
MASPAR  computer,  on  which  numerical  simulations  were  performed  as  described  in  Section  5. 

Each  element  of  the  matrix  is  an  individual  processor,  denoted  /i.j ,  i  =  0 . - 1 ;  j  =  0, . . . ,  -V,  - 

1,  where  ;V,  is  the  number  of  columns  of  the  matrix  and  .Vj,  is  the  number  of  rows.  The  total 
number  of  processors  is  therefore  A'^  =  A'^.V,.  It  is  also  possible  to  "unwrap"  the  matrix  of 
processors  and  consider  the  processors  as  a  linear  array.  In  this  case  the  processors  can  be  denoted 
/!.,  1  =  0 . A'„  -  1. 

Three  methods  are  now  proposed  to  perform  differencing  of  TOA’s  on  this  SIMD  architecture. 
The  first  method  is  designed  to  calculate  the  matrix  of  time  differences,  whereas  the  latter  two  are 
designed  for  histogramming. 


3.1  Method  1 


This  method  aims  to  produce  the  TOA  difference  matrix,  A.  It  does  not  calculate  the  TDO.A 
histogram,  but  it  is  a  straightforward  operation  to  collate  the  matrix  into  the  histogram.  It 
is  a  nearest  neighbour  algorithm,  which  gives  a  speed  advantage  in  large  processor  matrices. 
Furthermore,  the  use  of  communication  along  diagonals  means  that  it  is  well-suited  for  future 
implementation  on  the  MASPAR  computer,  whose  architecture  is  described  in  Section  5.1. 

The  requirement  for  this  method  is  that  the  processor  matrix  si^e  be  jV^  x  N^. 

First,  the  times  of  arrival,  f,,  are  stored  in  the  first  row  of  the  processor  matrix,  fio.,,  >  - 

0 _ ,.Vp  -  1.  Next,  the  first  differences,  are  calculated  and  stored  in  the  first  column 

of  the  matrix,  p,+i,o.  i  =  0, . . . ,  iVp  -  2  as  shown  in  Figure  1(a).  The  next  step  is  to  add  the 
contents  of  the  adjacent  elements  of  the  first  processor  column  to  obtain  the  A.  ,+2  since 

Ai,i>i  -h  Ai+1,,+2  =  A;  ,+2- 

The  results  are  then  stored  in  the  adjacent  processor  to  the  right,  i.e.  the  second  column.  The 
contents  of  the  processor  matrix  are  then  as  shown  in  Figure  1(b).  The  remaining  elements 
can  be  computed  using  the  following  identity: 

Al  t  =  A|_t_i  -I-  Atil  t  —  A'  i  t_i, 

where  0  <  /  <  Jt  <  Np  This  differencing  operation  is  illustrated  in  Figure  1(c).  This  step  is 
repeated  until  all  elements  of  the  A  matrix  have  been  calculated.  The  configuration  of  the 
matrix  at  the  end  of  the  matrix  is  shown  in  Figure  1(d). 

The  operations  which  calculate  the  values  for  each  new  column  in  the  matrix  are  independent 
of  one  another,  and  therefore  can  be  performed  in  parallel.  For  each  element  of  the  column, 
the  number  of  operations  is  fixed.  The  number  of  sequential  steps  that  must  be  performed 
is  therefore  determined  by  the  number  of  columns,  Np.  Hence,  this  method  requires  O(.Vp) 
steps  to  complete.  The  number  of  processors  required  is  O^ATp),  and,  because  only  one 
variable  needs  to  be  stored  on  each  processor,  the  storage  requirements  per  processor  is  0(1). 
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(d)  Method  for  calculating  higher  order  differ¬ 
ences. 


Figure  1  Method  1  for  SIMD  Architectures 


3.2  Method  2 


A  method  is  presented  here  which  calculates  the  TDOA  histogram,  instead  of  the  TO.A 
matrix.  The  processors  are  viewed  as  a  1-dimensional  array  rather  than  as  a  matrix.  The 
times  of  arrival  (j  are  mapped  onto  the  processors  Hi  so  that  each  processor  holds  one  time  of 
arrival.  Each  processor  also  holds  one  histogram  bin,  i.e.  h,  is  stored  on  /i,-.  This  is  illustrated 
in  Figure  2(a). 

For  the  first  step  of  the  algorithm,  each  processor  passes  —  ■‘shifts"  —  a  copy  of  its  TOA  to 
its  neighbour  on  the  right,  i.e.  tj  and  are  now  stored  on  /i,.  The  differences,  di,  of  the 
pairs  are  now  calculated  so  that  the  /i,  now  also  contain  A,_|  i,  t  >  0.  Note  that  /Jn  becomes 
inactive  since  t_i  is  not  recorded.  The  histogram  bin  to  which  the  difference  belongs  is  given 
by  the  expression 

Pi  =  Ld./riJ.  (2) 
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(a)  Initiation  of  the  hiatogram  with  hrst  order  dif- 
ferences  using  Method  d  for  ,Vp  =  5. 


(b)  Updating  the  histogram  with  second  order  dif¬ 
ferences  using  Method  2  for  ,Vp  =  5. 

Figure  2  Method  2  for  SIMD  Architectures 


The  processor  which  stores  the  relevant  histogram  bin  for  the  difference  stored  in  pi  is  there¬ 
fore  ftp, .  Hence,  ft,  sends  a  signal  to  ftp.  so  that  the  bin  is  incremented.  However,  if  p,  >  Ap 
then  no  processor  is  responsible  for  the  histogram  bin  and  no  processor  is  signalled.  The 
second  order  differences  are  then  computed  by  again  passing  the  copy  of  the  received  TOA 
on  to  its  neighbour  so  that  pj, »  >  1,  now  holds  f,  and  f,_2.  Both  po  and  pi  are  now  inactive. 
Again,  the  difference  of  the  f’s  are  calculated  and  the  appropriate  histogram  bin  signalled  (if 
in  range).  This  is  shown  in  Figure  2(b). 

The  action  of  shifting,  differencing  and  signalling  is  repeated  in  that  order  until  all  processors 
are  inactive  (i.e.  all  differences  have  been  calculated),  or  until  a  stage  is  reached  where  all 
the  Pi  are  out  of  range.  If  the  p;  are  all  out  of  range,  then  no  changes  will  be  made  to 
the  computed  histogram  in  current  or  future  iterations.  Once  one  of  these  criteria  has  been 
satisfied,  the  histogram  has  been  formed. 

The  operations  in  each  processor  are  independent  of  the  others,  and  so  O(lVp)  sequential 
steps  are  required  for  this  method,  assuming  that  all  differences  are  calculated  and  that  the 
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signalling  between  processors  can  be  performed  entirely  in  parallel.  The  storage  requirements 
are  0(  1 )  per  processor  since  only  one  TOA  and  one  histogram  bin  need  to  be  stored  locally 
on  a  processor. 

When  implementing  this  method  on  the  MASPAR,  consideration  must  be  made  of  the  fact 
that  the  signalling  required  by  this  \fethod  cannot  be  performed  entirely  in  paraUd  since  a 
processor  can  only  accept  one  signal  at  a  time.  In  the  worst  case,  in  which  a  single  periodic 
pulse  train  is  present  in  the  data,  each  processor  will  wish  to  contact  the  same  histogram 
bin/ processor.  Hence,  the  time  requirement  for  signalling  is  then  0{Np),  and  the  overall  time 
required  is 

Note  that  it  is  to  be  expected  that  this  method  would  be  sensitive  to  bin  width  because  the 
variable  number  of  differences  which  must  be  formed  until  the  histogram  is  complete  for  a 
given  pulse  density  (or  mean  pulse  arrival  rate).  When  the  bin  width  is  narrow,  or  the  mean 
pulse  rate  is  low,  the  number  of  orders  of  differences  required  is  small  because,  after  only  a 
few  orders,  the  differences  are  large  enough  that  none  fall  within  the  range  of  the  histogram. 
A  roughly  linear  relation  between  time  required  and  histogram  bin  width  is  expected  because 
of  the  roughly  linear  relation  between  number  of  orders  of  differences  required  and  histogram 
bin  width.  Similarly,  a  linear  relation  is  expected  between  time  required  and  mean  pulse  rate. 


3.3  Method  3 


The  method  of  forming  the  histogram  in  parallel  discussed  above  makes  extensive  use  of  a 
global  communications  system  to  notify  processors  when  a  histogram  bin  is  to  be  incremented. 
On  the  MASPAR  computer,  such  global  communications  are  considered  a  precious  resource. 
Local  communication  is  much  to  be  preferred. 

A  second  problem  with  the  previous  method  is  that  there  is  significant  latency  in  the  pro¬ 
cessors  if  all  differences  are  to  be  calculated.  In  fact,  on  average,  half  of  the  processors  are 
inactive. 

To  avoid  global  communication  as  much  as  possible,  a  partial  histogram  is  stored  in  a  complete 
set  of  bins  on  each  processor,  and  merged  in  an  orderly  fashion  after  all  differencing  has  been 
performed.  The  partial  histograms  are  denoted  hlj,  0  ^  i,j  <  Np,  where  the  first  index 
denotes  the  processor  number  on  which  it  is  stored,  and  the  second  index  is  the  histogram 
bin  within  the  processor.  To  maximise  processor  utilisation,  the  TOAs  are  “rotated”  through 
the  processor  array  rather  tham  shifted.  By  “rotated”  it  is  meant  that  the  data  are  passed 
as  previously  described,  except  that  the  last  processor  in  the  array  also  passes  its  TOA  data 
to  the  first  element  in  the  array,  as  illustrated  in  Figure  3.  Hence,  after  the  first  rotation,  fii 
holds  t(  and  t,_i,  for  «  >  0,  and  Ho  holds  to  and 

The  method  proceeds  by  first  rotating  the  data.  The  absolute  differences  di  are  then  formed. 
Hence,  fit  holds  for  «  >  0  and  /lo  holds  Ao,y,_i.  The  index  into  each  processor’s 

own  private  partial  histogram  is  pi  from  (2).  Thus,  is  incremented.  The  rotation  and 
differencing  steps  are  performed  again  so  that  p,-  now  holds  t,-,  f,_j  and  A,_2.,-,  for  i  >  1,  and 
fi+jv,-3,  and  A(,i+jv,-3  for  «  =  0,1.  The  index  is  calculated  and  the  appropriate  partial 
histogram  bin  is  incremented  as  before  if  it  is  within  the  range  of  bins  stored. 

These  steps  are  repeated  until  jVp/2  rotations  have  been  performed,  assuming  jVp  is  an  even 
number.  After  the  final  rotation,  each  processor  pi  contains  tj  and  tj+w,/3)  where  j  =  for 
i  <  lVp/2,  or  j  =  »  -  jVp/2  otherwise.  Hence,  Pi  and  Pi+N,/2  contain  the  same  pair  of  TOAs. 
For  this  reason,  on  the  last  step,  half  of  the  processors  are  made  inactive  to  avoid  duplicate 
differences.  However,  for  those  processors  remaining  active  the  final  step  proceeds  as  before. 
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R)  ^^2  f^3  ^^4 


Figure  3  Calculation  of  the  partial  histograms  of  first  order  differences  using  Method  3 
for  iVp  =  5. 

It  now  remains  to  merge  the  partial  histograms  into  a  complete  histogram.  This  is  performed 
in  a  two-dimensional  manner,  i.e.  it  makes  use  of  the  matrix  layout  of  the  processors.  More 
particularly,  it  makes  use  of  the  assumption  that  the  processors  are  leiid  out  as  a  square 
matrix  and  that  the  number  of  processors  is  an  integer  power  of  4.  However,  the  method  can 
be  generalised  to  non-square  matrices  and  to  numbers  of  processors  which  are  not  integer 
powers  of  4,  although  the  generalisations  are  not  presented  here. 

The  merging  process  proceeds  by  gathering  and  distributing  fewer  and  fewer  partial  results 
further  and  further  afield,  until  the  partial  results  sum  to  give  the  complete  histogram.  The 
algorithm  for  merging  proceeds  as  follows: 

i.  Let  all  the  partial  histogram  bins  on  each  processor  be  declared  “current.” 

ii.  Let  the  distance  to  neighbours,  c,,  equal  1. 

iii.  Each  processor  chooses  a  quarter  of  its  partial  bins  to  remain  current  in  a  way 
determined  by  its  processor  number. 

iv.  The  remainder  of  the  bins  which  were  previously  current  are  distributed  to  the  three 
neighbours  who  aie  a  distance  e,  away  in  the  eastern,  southern  and  south-eastern 
directions. 

V.  Simultaneously,  each  processor’s  current  bins  are  updated  with  information  received 
(or  fetched)  from  neighbours  who  are  a  distance  «,  away  in  the  western,  northern 
and  north-western  directions. 

vi.  If  the  number  of  current  bins  per  processor  equals  one,  then  the  merge  has  finished 
amd  the  set  of  current  bins  from  each  processor  constitutes  the  complete  histogram. 

vii.  Otherwise,  double  the  distance  to  neighbours,  f,,  and  repeat  from  step  iii. 
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The  algorithm  is  now  explained  in  greater  detail.  For  each  step  «,  s  =  1, . .  ..log^  .V,,  the 
partial  histogram  on  each  processor  is  updated  thus 


0, 


when  i  mod  2c,  =  k  mod  2e,, 
j  mod  2e,  =  1  mod  2c,, 

otherwise, 


(3) 


where  h\j  i,  ,  —  h|,„,oaN,)/v,+(jino<iAf,),tA(,+i»  for  all  0  <  i,k  <  .V,,  0  $  j,l  <  jV,  and  c,  —  2*"‘. 
After  ail  steps  have  been  completed,  bj =  A,  and  the  histogram  is  complete. 

For  step  s  of  the  mer^ng  process,  each  process  must  fetch  3iVp/4cJ  data  elements  over  a  matrix 
distance  of  c,  processors:  horizontally,  vertically,  and  diagonally.  If  a  cost  of  tj  =  0(c,)  is 
assigned  to  fetching  data  over  a  distance  c,  (as  it  is  on  the  MASPAR),  then  the  total  time 
requirement  for  the  merging  process  is 

"K'-S.V.r, 


3N,  0(2*) 

4  h  4' 


-  ^  E  0(2”) 


=  0{N,). 


Hence,  the  time  required  for  mer^ng  is  not  dependent  on  the  form  of  the  data  as  it  is  in 
Method  2,  but  on  the  size  of  the  data.  This  may  be  an  advantage  in  system  design,  since  the 
time  required  to  compute  the  histogram  can  be  predicted  with  a  high  degree  of  accuracy. 


An  example  of  the  merging  algorithm  for  the  case  of  a  small  processor  matrix  (iVp  =  16, 
■=  Ny  =  4)  is  shown  in  Figure  4.  The  figure  illustrates  the  two  merging  steps  which  are 
required  to  merge  the  partial  histograms  into  a  complete  histogram.  The  pairs  of  numbers 
displaced  in  each  processor  circle  represent  the  i,j  pairs  of  the  which  are  to  be  fetched 
&om  other  processors.  The  arrows  indicate  from  which  processors  the  data  is  being  sent. 
Dashed  arrows  indicate  that  the  toroidal  nature  of  the  matrix  is  being  exploited. 


Both  the  differencing  and  merging  processes  require  O(Afp)  steps  to  complete  and  so  Method  3 
requires  O(tfp)  steps.  The  storage  requirements  per  processor  are  now  0{Np)  also  since  a 
partial  histogram  is  stored  on  each  processor.  However,  the  linear  relation  between  pulse 
rates  or  bin  widths  and  time  required  is  no  longer  expected  as  it  was  for  Method  2.  Rather, 
it  is  to  be  expected  that  the  algorithm  would  require  a  fixed  amount  of  time,  depending  only 
on  the  number  of  pulses  in  the  buffer,  because  all  loops  in  the  algorithm  iterate  for  a  fixed 
number  of  times,  independent  of  bin  widths  and  pulse  arrival  rates. 


3.4  Comparison  of  Resource  Requirements 

It  is  possible  to  compare  the  methods  described  above  in  terms  of  their  usage  of  important 
computing  resources,  such  as  time,  processors  and  memory.  Such  a  comparison  is  presented 
in  Table  1,  where  time,  processors  and  local  and  global  memory  have  been  listed  as  functions 
of  the  number  of  data  points  in  order  notation.  Note  that  the  stated  requirements  for  local 
memory  are  per  processor. 
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(a)  First  merge.  (b)  Second  and  final  merge. 


Figure  4  Example  of  the  merging  algorithm  of  Method  3  for  a  4  x  4  processor  grid. 
Table  1  Comparison  of  resource  requirements  for  SIMD  methods. 


Method 

Time 

Processors 

Local  mem. 

Global  mem. 

Method  1 

0(Ap) 

o{n^) 

0(1) 

0(1) 

Method  2 

o(fv;) 

0{N,) 

0(1) 

0(1) 

Method  3 

0{N,) 

0{N,) 

0(iV,) 

0(1) 

4  HISTOGRAMMING  USING  THE  VECTOR  PROCESSOR  ARCHITECTURE 

The  vector  processor  model  is  in  some  ways  similar  to  the  SIMD  model.  A  vector  processor  typically 
consists  of  a  number  of  vector  “units”.  Each  of  these  units  is  roughly  analogous  to  a  simple, 
specialised  SIMD  processor  array.  Within  the  vector  unit,  a  simple  hard-coded  operation,  specific 
to  the  unit,  is  performed  on  an  array  or  vector  of  data,  either  in  a  pipeline  or  in  parallel  or  both. 
Typically,  vector  units  are  built  for  vector  addition,  subtraction,  multiplication  and  division  and 
operations  which  cannot  make  use  of  the  specialised  vector  units  are  performed  serially  according 
to  the  usual  von  Neumann  SISD  model.  The  economies  of  scale  which  are  achieved  by  using  vector 
units  over  a  standard  arithmetic  logic  unit  (ALU)  are  very  significant,  and  this  is  why  they  were 
and  are  still  popular  with  manufacturers  of  supercomputers. 

The  method  of  performing  histogramming  here  is  somewhat  naiVe.  The  vectors  u,-,  i  =  0, . . . ,  jVp/2  -  1 
are  formed,  where  u,-  is  the  vector  t  of  (1)  rotated  •  places.  Hence 

Uo  =  [  to  ti  tj  ■  •  •  (at,-!  j  =  f 
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The  algorithm  proceeds  from  step  s  =  1, . . JV,/2  -  1.  At  each  step  s,  the  operation 

a.  =  Iv,  -  Vol 

is  performed,  where  a,  is  a  vector  of  absolute  time  differences  and  in  this  case  |  '|  is  the  absolute  value 
of  each  element  of  the  vector.  This  operation  can  be  efficiently  performed  in  a  vector  difference  unit. 
The  method  of  rotation  and  absolute  differencing  is  analogous  to  that  performed  in  Method  3  in 
Section  3.3.  It  results  in  all  of  the  A,,  being  calculated  except  for  the  Aij+n^p.  As  with  Method  3 
above,  a  final  special  step,  s  =  Np/2,  must  be  performed  in  which  two  vectors  of  rank  iVp/2  are 
formed  and  differenced  so  that 

aN,/2  =  |row(t,iVp/2,iVy/2)  -row(t,0,AV/2)|, 

where  row(2,  i,j)  is  the  vector  formed  from  j  rows  of  vector  x  starting  from  row  i. 

At  the  end  of  each  step,  the  elements  of  the  vector  o,  are  sorted  into  the  appropriate  histogram  bins. 
The  division  which  is  required  to  calculate  the  appropriate  bins  for  each  element  can  make  use  of 
a  vector  division  unit.  On  the  Fujitsu  VP-2200  (which  was  used  to  implement  this  algorithm),  the 
vector  difference  and  division  units  were  the  only  two  used.  All  other  operations  were  performed  in 
the  traditional  serial  manner.  An  illustration  of  the  implementation  of  the  algorithm  on  a  vector 
processor  with  AT,  =  5  is  shown  in  Figure  5. 


Vector  Vector  Main 

Oiffeienca  Ohriaaon  Memory 

Unit  Unit 

Figure  5  Implementation  of  paratUel  bistogramming  on  a  vector  processor. 

The  time  required  for  vector  processing  in  the  way  described  above  requires  computational 

steps  and  the  storage  requirements  are  O(Afp)  +  0{Nn)  (since  it  is  not  necessary  for  this  algorithm 
to  have  Af,  =  jV^). 
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5  NUMERICAL  SIMULATIONS 

Numerical  simulations  were  performed  on  several  types  of  computers  using  the  algorithms  described 
previously.  A  most  important  feature  of  deinterleaving  from  an  operational  perspective  is  the  time 
required  for  deinterleaving  a  given  number  of  points.  Throughout  the  fcdlowing  discussions,  sets 
of  data  points  containing  the  times  of  arrival  of  40%  pulses  in  double  precision  real  numbers  were 
histogrammed.  This  number  of  pulses  was  chosen  because  it  matched  the  number  of  processors  on 
the  MASPAR  computer.  The  time  required  to  perform  the  histogram  only  was  measured.  Hence, 
the  measurements  take  no  account  of  the  time  required  to  read  the  TOA’s  off  disk  or  to  display 
or  otherwise  process  the  results  of  the  histogram  subsequently.  While  these  times  may  indeed  be 
critical  to  the  performance  of  a  deinterleaving  system,  it  is  outside  the  scope  of  this  report. 


5.1  The  MASPAR  Architecture 


The  MASPAR  MP-1  consists  of  two  major  functional  parts:  the  Array  Control  Unit  (ACU), 
and  the  Processor  Clements  (PE’s)  [3,5].  The  ACU  has  at  its  core  a  conventional  SISD 
RISC  (Reduced  Instruction  Set  Computer)  microprocessor  which  coordinates  the  actions  of 
the  PE’s.  The  ACU  can  communicate  both  with  the  PE’s  or  with  its  host  computer  (in 
this  case  a  DECstation  5000)  via  VME  bus.  The  PE’s  themselves  can  communicate  with 
the  ACU  and  with  each  other.  Two  methods  exist  for  communication  between  PE’s,  and 
these  correspond  with  the  methods  described  in  Section  3.  The  first  is  a  nearest  neighbour 
mechanism,  through  which  each  PE  can  communicate  with  its  nearest  neighbour  around  the 
four  major  compass  points  and  around  the  four  minor  compass  points.  This  mechsinism  is 
called  the  “Xnet.”  It  can  also  be  used  for  communication  between  processors  over  a  specified 
distance  along  the  major  and  minor  compass  points.  The  second  mechanism  is  a  means  by 
which  any  PE  can  communicate  with  any  other  PE,  rather  like  a  telephone  network.  This 
mechanism  is  known  as  the  “Global  Router.”  However,  the  MASPAR  PE’s  are  grouped  into 
blocks  of  16  and  each  block  has  just  one  router  line.  Hence,  each  PE  within  the  block  must 
compete  for  use  of  the  “party  line.”  Additionally,  this  is  a  slower  method  of  communication 
than  the  Xnet,  and  so  Xnet  is  to  be  preferred  where  convenient.  A  simplified  system  diagram 
of  the  MASPAR  is  shown  in  Figure  6. 


Figure  6  System  overview  of  the  MASPAR  MP-1  computer. 
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5.2  Implementation  on  the  MASPAR 

Methods  2  &  3  of  the  SIMD  algorithms  were  implemented  on  a  MASPAR  computer  in  MPL 
(MASPAR  Programming  Language)  [4].  The  MASPAR  which  was  used  for  the  numerical 
simulations  has  40%  individual  processors  (Nf  =  40%),  laid  out  as  a  64  x  64  two-dimenuonal 
grid.  Method  2  makes  use  of  both  the  Xnet  for  the  data  shifting  and  the  global  router  for 
incrementing  histogram  bins.  Method  3  makes  use  M  the  Xnet  only,  both  for  data  rotation  and 
the  final  histogram  merge.  Listings  of  the  MPL  code  are  contained  in  Appendix  Sections  1.1- 
1.3. 

The  two  methods  were  trialled  for  two  pulse  data  sets,  and  for  various  histogram  bin  widths. 
Each  pulse  data  set  contains  40%  pulses.  The  first  data  set  contained  20  separate  periodic 
emitters  with  PRI’s  ranging  from  10.1  to  19.3  arbitrary  time  units,  with  an  average  PRI  of 
14.91  units  and  a  mean  pulse  rate  of  one  pulse  per  0.7161  units.  The  second  data  set  contains 
a  single  periodic  emitter  with  a  PRI  of  13  units.  Histogram  bin  widths  were  varied  between 
0.01  units  and  0.2  units 

The  times  required  to  compute  the  TDOA  histograms  for  each  method  were  measured,  and 
these  are  plotted  in  Figure  7(a).  FVom  the  plot,  it  can  be  seen  that  Method  2  is  sensitive 


0  I— i . .  . . .  i  5.  ;  U  T  -  -t 

0.00  0.05  O.tO  0.t5  0.20 


Bin  width 

(a)  Computatioa  time  vs.  hiatogiam  bin  width  for  MASPAR 
simulations. 


Bin  width 

(b)  Computation  time  vs.  histogram  width  for  Fujitsu  rimn- 
latbns. 


Figure  7  Results  of  numerical  simulations. 

to  both  bin  width  and  mean  pulse  rate,  whereas  Method  3  is  completdy  insensitive  to  both 
of  these  factors,  as  predicted  in  Sections  Section  3.2  and  Section  3.3.  That  is.  Method  2  is 
sensitive  to  bin  width  and  mean  pulse  rate  because  its  main  loop  terminates  as  soon  as  the 
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minimum  difference  at  that  order  is  iarger  than  the  maximum  PRI  being  recorded  in  the 
histogram.  Method  3  is  insensitive  to  both  factors  since  the  complete  set  of  differences  is 
always  calculated.  It  is  also  found  that  the  histogram  merging  operation  offers  substantial 
time  savings  over  contacting  processors  individually  after  each  difference  operation  (as  used  in 
Method  2)  if  the  full  set  of  differences  is  calculated.  The  time  required  for  histogram  merging 
in  Method  3  is  0.2943  seconds  whereas  Method  2  requires  0.0054  seconds  for  interprocesscv 
communication  after  each  round  of  differencing.  Hence  if  more  than  about  the  54th  order  of 
differences  is  required,  the  communication  time  required  for  Method  3  will  be  less  than  that 
required  by  Method  2. 


5.3  The  Fujitsu  VP-2200  Architecture 

The  Fujitsu  VP-2200  is  a  traditional  vector  processor.  It  consists  of  several  specialised 
processing  units,  along  with  a  general  von  Neumann-type  SISD  processor. 


5.4  Implementation  on  the  Fujitsu  VP-2200  Architecture 

The  method  as  outlined  in  Section  4  was  implemented  in  Fortran  on  the  Fujitsu  VP-2200.  It 
was  compiled  using  all  vectorising  options  turned  on  to  maximise  the  use  of  the  vector  units. 
A  listing  of  the  Fortran  subroutine  which  was  used  to  calculate  the  histogram  is  , (resented  in 
Section  1.4. 

The  method  was  tested  for  both  data  sets  described  in  Section  5.2.  As  for  the  MASPAR, 
the  times  required  for  histogramming  were  measured  and  they  are  graphed  in  Figure  7(b). 
Again,  there  is  a  linear  relation  between  binwidth  and  time  required,  and  the  relationship 
is  also  dependent  on  mean  pulse  repetition  frequency.  The  dependency  on  both  binwidth 
and  mean  PRF  is  much  reduced,  since  all  differences  are  calculated,  regardless  of  the  form 
of  the  data.  The  dependency  occurs  in  the  step  where  differences  are  sorted  into  bins.  If  the 
histogram  array  index  is  out  of  bounds,  no  increment  can  be  performed.  In  situations  where 
the  mean  PRF  is  low  compared  with  the  binwidth,  the  indices  will  all  be  out  of  bounds  after 
a  few  iterations,  so  the  histogram  will  be  formed  in  slightly  less  time. 


5.5  Compairisons  Between  Implementations 


A  comparison  between  architectures  and  algorithms  can  be  made  by  selecting  a  certain  set  of 
parameters,  running  the  algorithms  on  each  of  the  architectures  with  these  parameters,  and 
measuring  the  time  required  to  compute  the  histogram.  This  was  done  for  the  MASPAR  MP-1 
for  Methods  2  and  3, 2uid  on  the  Etyitsu  VP-2200,  Hewlett-Packard  Series  9000  Models  720 
and  730  workstation,  and  on  the  Sun  SPARC  2  workstation  using  the  Fbrtram  algorithm.  The 
parameters  used  in  the  tests  were  a  bin  width  of  0.1  units  and  the  first  data  set  (consisting 
of  4096  pulses  from  20  periodic  emitters  with  mean  PRI  of  0.7161  units). 

The  following  table  summarises  the  results  obtained.  The  results  show  that  the  best  algo¬ 
rithm  and  architecture  in  this  case  is  the  MASPAR  using  Method  3.  The  rate  of  calculation 
corresponds  to  approximately  4000  pulses  per  second. 
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Table  2  Comparisons  between  architectures  and  algorithms. 


Architecture 

Algorithm/Language 

Time  (s) 

MASPAR  MP-1 

Method  2/MPL 

3.09 

Method  3/MPL 

0.98 

FNijitsu  VP-2200 

Vectorised  Fortran 

1.31 

HP  720 

Optimised  Fortran 

7.20 

HP  730 

Optimised  Fortran 

5.44 

Sun  SPARC  10 

Optimised  Fortran 

11.70 

Sun  SPARC  2 

Optimised  Fortran 

33.77 

6  SUGGESTED  IMPROVEMENTS 

A  number  of  changes  could  be  made  to  the  algorithms  to  improve  the  speed  of  calculation  of  the 
histogram.  Improvements  which  would  benefit  aJl  algorithms  and  architectures  are; 


i.  use  of  fixed  point  (integer)  rather  than  double  precision  floating  point; 

ii.  expressing  all  times  of  arrival  in  binwidth  units.  That  is,  before  any  diflerencing  takes  place, 
all  of  the  TOAs  would  be  divided  by  the  binwidth.  If  this  improvement  were  applied,  the 
integer  part  of  the  difference  would  become  the  index  of  the  histogram  bin.  Also,  the  number 
of  division®  '•^'quircd  would  now  be  0(Np)  rather  than  O(jV^),  although  the  total  number  of 
operations  would  still  be 


The  adoption  of  these  improvements  on  an  HP  730  gives  a  measured  total  time  of  1.19  seconds, 
when  tested  under  the  conditions  used  in  Section  5.5.  If  the  greater  than  four  times  speed  up  can 
be  transferred  to  the  MASPAR  (a  reasonable  assumption),  then  a  histogram  calculation  rate  of 
nearly  20,000  pulses  per  second  may  be  possible. 

Specific  to  the  MASPAR,  the  following  changes  might  further  increase  the  processing  rate: 


i.  storage  of  the  entire  TOA  records  in  each  processor  element,  rather  than  merdy  storing  two 
TOAs,  which  would  reduce  some  of  the  inter-PE  communication  overhead; 

ii.  pipelining  the  histogram  calculations.  For  instance,  the  PE  array  could  be  divided  into  four 
groups.  Each  group  would  work  at  a  different  stage  in  the  calculation  process. 


Note  that,  because  the  histogram  is  an  process  (0(Np)  for  the  MASPAR),  reducing  the 

number  of  pulses  to  be  histogrammed  by  a  factor  of  two  results  in  a  speed  increase  of  up  to  four 
times.  Farther,  it  may  not  be  necessary  to  calculate  all  the  orders  of  difference.  Reducing  the 
number  of  orders  of  difference  to  a  fixed  number,  potentially  reduces  the  order  of  the  process  from 
to  O(JVp).  For  the  examples  used  in  this  report  in  which  4096  pulses  have  been  used,  then 
limiting  the  calculation  of  the  histogram  up  to  the  200th  order  of  difference  may  yidd  a  speed 
improvement  of  around  10  times.  However,  the  more  crowded  the  pulse  environment,  the  more 
orders  will  be  required  to  identify  a  sufficient  number  of  harmonics  for  detection. 
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7  FUTURE  DIRECTIONS 

Many  improvements  have  been  discussed  which  have  not  yet  been  rigorously  implemented  and 
tested.  It  is  planned  that  some  or  all  of  these  will  be  tested.  Further,  there  are  some  architectures 
which  have  not  been  examined  and  which  deserve  attention.  Notable  amongst  these  are  the  hyper¬ 
cube  SIMD  architecture  employed  by  the  Connection  Machine,  and  MIMD  architectures  such  as 
transputer  arrays. 

The  related  question  of  how  to  perform  detection  of  PRTs  from  the  histograms  once  they  have  been 
computed  is  an  important  topic  for  further  research.  In  particular,  the  use  of  neural  networks  in 
detection,  and  how  to  present  TOA  data  to  a  neural  network  needs  to  be  investigated. 

It  is  hoped  that  Method  1.  discussed  in  Section  3.1,  being  a  neairest  neighbour  algorithm,  will  be 
suitable  for  ready  implementation  on  a  hypercube  architecture,  and  with  further  work,  may  be 
useful  in  neural  network  investigations. 


8  CONCLUSIONS 

Four  methods  have  been  presented  for  calculating  the  time  of  arrival  histogram  for  a  pulse  data 
set.  Two  architectures  have  been  considered:  a  SIMD  parallel  processor  (the  MASPAR  MP-1)  and 
a  vector  processor  (the  Fujitsu  VP-2200).  Three  of  the  algorithms  and  each  of  the  architectures 
were  tested  under  various  conditions.  Comparisons  were  mtule  between  the  described  algorithms 
and  architectures  and  with  more  well-known  architectures  such  as  Hewlett-Packard  and  Sun  work¬ 
stations.  One  algorithm  (Method  1)  was  considered  unsuitable  for  testing  and  comparison  because 
it  does  not  calculate  a  histogram. 

It  was  found  that  the  MASPAR  was  able  to  calculate  the  histogram  fastest.  The  speed  of  the 
algorithms  on  all  architectures  was  limited  by  the  use  of  double  precision  arithmetic  and  the  large 
number  of  pulses  used  in  the  processing.  Given  these  restrictions,  and  given  also  that  no  account  of 
the  time  required  for  input  and  output  of  the  data,  a  maximum  processing  rate  of  about  4000  pulses 
per  second  was  achieved. 

Several  improvements  were  suggested.  The  incorporation  of  these  improvements  into  algorithms 
on  the  MASPAR  may  lead  to  a  calculation  rate  for  the  full  histogram  of  around  20,000  pulses  per 
second.  With  the  recent  release  of  the  new,  faster  MASPAR  MP-2,  even  this  number  could  well  be 
exceeded.  However,  this  has  yet  to  be  confirmed. 
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APPENDIX  I 


PROGRAM  LISTINGS 

The  following  are  listings  of  the  code  used  in  generating  the  benchmarks  in  Section  5  of  this  report. 
Only  the  subroutines  which  generate  the  histogram  are  reproduced  here:  the  main  body  of  the 
programs  which  include  reading  in  the  TOA  data  from  files  and  decoding  command  line  options, 
and  the  code  for  timing  and  profiling,  has  been  omitted. 

Sections  1. 1-1.3  are  listings  of  the  MPL  code  used  on  the  .VfASPAR.  They  consist  of  two  versions 
of  hint. m  as  well  as  a  header  file,  vlcjipX.h. 

Section  I.-4  is  a  listing  of  the  Fortran  subroutine  calcHist  used  both  on  the  Fujitsu  \'P-2200  vector 
processor  and  for  comparison  on  the  Hewlett-Packard  720  and  730  and  Sun  SPARC  2  workstations. 


I.l  SIMD  Method  2  —  hist.m 


1  • include  <mpl.h> 

2  tinclude  <math.h> 

3  tinclude  <sendvith.h> 

4  tinclude  "vlc.mpl.h" 


5  plural  bool 

6  HistogramCtoa,  PRIapacing) 


7  regiater  plural  double 

8  regiater  double 

9  { 

10  regiater  int 

11  regiater  plural  bool 

12  register  plural  procno 

13  register  plural  procno 

14  register  plural  double 

15  register  double 

16  register  plural  bool 

17  /•  Pre:  the  toa  are  in 


toa; 

PRIapacing; 

order  =  0; 
result ; 
hist.bin; 

my .hist  =  0,  to.send; 
nezt.toa,  pass.toa,  diff; 
timeRange; 

destInRange  =  TRUE,  activeSet; 
ascending  order  sith  iproc  */ 


18  timeRange  »  proc[nproc-l] .toa  -  procCO] .toa; 

19  pass.toa  =  toa; 

20  hist.bin  =  iproc; 

21  to.send  =  0; 

22  vhlle  ((■•■+order  <  nproc)  **  (iproc  >•  order)  tk  destInRange)  { 

23  if  (izproc  >  0) 

24  next.toa  «  xnetWCl] .pass.toa; 

25  else 

26  nezt.toa  =  znetNH[l] .pass.toa; 

27  pass.toa  »  nezt.toa; 

28  diff  »  toa  -  nezt.toa; 

29  hist.bin  »  (plural  int)  p.f loor(diff  /  PRIspacing) ; 

30  destInRange  »  FALSE; 


UNCLASSIFIED 


19 


ERL-0692  RR 


UNCLASSIFIED 


31  to.send  »  0; 

32  if  (hist.bin  <  nproc)  { 

33  dastInRange  »  TRUE; 

34  to.send  »  1; 

35  } 

36  all  < 

37  ny.hlst  sandsithAddl6u(to_aend,  hist.bin); 

38  hist.bin  >  iproc; 

39  to.send  =  0; 

40  } 

41  } 

42  raturnCmy.hist) ; 

43  } 


1.2  SIMD  Method  2  —  vlcjnpl.h 

1  /*  Useful  MPL  declarations. 

2  Written  by  Vaughan  Clarkson,  April,  1991. 

3  Copyright  (c)  1991. 

4  •/ 

5  /*  Definitions  for  boolean  variables.  Should  be  an  enumerated  type  but 

6  presently  HPPE  can’t  cope  with  enums 

7  */ 

8  Sdefine  FALSE  0 

9  Sdefine  TRUE  (! FALSE) 

10  /*  This  awkward  construction  overcomes  the  limitation  of  the  natural 

11  HPL  while  statement  by  removing  the  restriction  that  the  active  set 

12  is  non-increasing 

13  */ 

14  Sdefine  FULLwhile(pvar)  while(reduca0r8u(pvar))  if  (pveo') 

15  /*  Wrap  around  xnet  so  that  PE  are  arranged  linearly  */ 

16  Sdefine  znetLEFT(pvar)  (ixproc  >  0  7  (xnetWCl] .pvar)  :  (xnetWWCl] .pvar)) 

17  Sdefine  xnetRIGHT(pvar)  (ixproc  <  nxproe-1  ?  (xnotE[l] .pvar)  :  (xnetSECl] .pvar)) 

18  typedef  unsigned  char  bool; 

19  typedef  unsigned  short  procno; 


1.3  SIMO  Method  3  —  hlst.m 

1  Slnclude  <mpl.h> 

2  S include  <math.b> 

3  Sdefine  NULL  0 
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4  plural  int  histograoC/*  register  plural  double  •/  toa, 

5  /*  register  double  */  binWidth) 

6  register  plural  double  toa; 

7  register  double  binWidth; 

8  { 

9  plural  unsigned  short  myHist[4096] ; 

10  int  order; 

11  plural  double  nextTOA.  passTOA,  diff; 

12  plural  int  histBin; 

13  int  foundationHask.  upperHask,  lowerHask,  distance; 

14  int  numToSend,  upper,  loser; 

15  plural  int  foundation,  my Index; 

16  passTOA  =  toa; 

17  for  (order  =  0;  order  <  nproc  /  2  -  1;  ordar++)  { 

18  if  (ixproc  >  0) 

19  nextTOA  =  xnetWCl] .passTOA; 

20  else 

21  nextTOA  =  xnetSWCl] .passTOA; 

22  passTOA  =  nextTOA; 

23  diff  =  p.fabs(toa  -  nextTOA); 

24  histBin  =  (plural  int)  p.floor(diff  /  binWidth) ; 

25  if  (histBin  <  nproc) 

26  myHist [histBin] 

27  > 

28  /*  And  again  for  the  final  iteration,  though  now  with  just  half  the  PEs  */ 

29  if  (iproc  <  nproc  /  2)  { 

30  if  (ixproc  >  0) 

31  nextTOA  xnetWCl]  .passTOA; 

32  else 

33  nextTOA  >  xnetHWCl] .passTOA; 

34  passTOA  =  nextTOA; 

35  diff  =  p.fabs(toa  -  nextTOA); 

36  histBin  »  (plural  int)  p_floor(diff  !  binWidth) : 

37  if  (histBin  <  nproc) 

38  myHist [histBin] +♦ ; 

39  > 

40  /♦  How  redistribute  the  histograms  */ 

41  /*  Assumptions:  nxproc  ==  nyproc  and  nproc  is  an  Integer  power  of  4  •/ 

42  /*  Can  be  generalised.  */ 

43  foundationHask  =  nxproc  I  1; 

44  upperHask  =  nxproc  «  1; 

45  lowerHask  =  2; 

46  distance  ^  1; 

47  for  (numToSend  >  nproc  »  2;  numToSend  >  0;  numToSend  »>  2)  { 

48  foundation  >  iproc  t  foundationHask; 

49  for  (upper  ■  0;  upper  <  nproc;  upper  +■  upperHask) 

50  for  (lower  *  6;  lower  <  nxproc;  lower  lowerHask)  •[ 

51  plural  unsigned  short  dummyE,  dummyS.  dummySE; 

52  plural  int  indexE.  IndexS.  indexSE; 
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53  my Index  «  foundation  I  upper  I  lower; 

54  IndexE  >  xnetWCdistance] .mylndex; 

55  indezS  >  znetNCdiatance] .mylndex; 

56  indexSE  >  xnetHW Cdlatance] .mylndex; 

57  dummyE  ■  xnetECdiatance] .myHlatLlndexE] ; 

58  dummyS  «  xnetSCdiatance] .myHiatCindexS] ; 

59  dummySE  >  znetSECdlatance] .myHiat[indexSE] : 

60  myHlat [mylndex]  ♦»  dummyE  ♦  dummyS  ♦  dummySE; 

61  > 

62  foundatlonMaak  1°  foundationMaak  «  1; 

63  upperMask  <<>  1; 

64  lowerHask  1; 

65  distance  «>  1; 

66  } 

67  retumCmyHist  [iproc] ) ; 

68  } 


1.4  Vector/SISD  Fortran  code 

1  subroutine  calcHistCtoa,  hist,  copy,  diff,  index) 

2  real*8  toa(») ,  copy(»)  ,  diffC*) 

3  integer  hist (*) ,  index(*) 

4  c 

5  c  Calculate  the  histogram  from  the  times  of  arrival  (TOAs)  in  toa 

6  c  and  place  the  results  in  hist.  The  arrays  copy,  diff  and  index 

7  c  are  scratch  arrays.  All  arrays  except  hist  must  be  at  least  of 

8  c  dimension  numtoa,  and  hist  must  have  a  minimum  dimension  of  numhist, 

9  c 

10  common  /histparams/  numtoa,  numhist,  binwidth 

11  integer  numtoa,  numhist 

12  real*8  binwidth 

13  integer  i,  j 

14  c  Assume  even  number  of  toas 

15  do  200  i  >  1,  numtoa  /  2  -  1 

16  do  300  J  »  1,  numtoa 

17  if  (j  .le.  i)  than 

18  copy(j)  »  toaCnumtoa  -  i  ♦  j) 

19  else 

20  copy(j)  «  toaCj  -  i) 

21  endif 

22  diff(j)  =  abs(toa(j)  -  copy(j)) 

23  index(J)  ■  int(diff{j)  /  binwidth)  +  1 

24  if  (indez(j)  .la.  numhist)  then 

26  hist(index(J))  *  hist(index(J))  ♦  1 

26  endif 

27  300  continue 

28  200  continue 

29  do  400  J  1,  numtoa  /  2 
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30 

31 

32 

33 

34 


difl(j)  •  abs(toa(J  ♦  nuntoa  12)-  toa(j)) 
indez(j)  «  iiit(diff(j)  /  binvidth)  ♦  1 
if  (indaz(j)  .le.  nuahist)  then 

hist(lndez(j))  *  hist(indez(j))  *  1 
end  if 
400  continue 
end 
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